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for  Physical  Database  Design 

R.  H.  Katz  and  E.  Wong 
University  of  California,  Berkeley 
Berkeley,  CA.  94720 


ABSTRACT;  Design  and  Access  Path  Data  Models  are  presented 
to  form  an  integrated  framework  for  logical  and  physical 
database  design  in  a  heterogeneous  database  environment. 
This  paper  focuses  on  the  physical  design  process.  First,  a 
physical  design  is  specified  in  terms  of  general  properties 
of  access  paths,  independent  of  implementation  details. 
Then,  a  physical  design  is  realized  by  mapping  the  specifi¬ 
cation  into  the  storage  structures  of  a  particular  database 
system.  Algoritnms  for  assigning  the  properties  to  logical 
access  paths  and  for  realizing  a  CODASYL  78  DBTG  schema  are 

given. 

J  r 

1.  Introduction 


As  the  trend  towards  distributed  database  systems  continues 
to  gain  in  momentum,  the  problem  of  database  design  in  a 
heterogeneous  environment  is  becoming  crucial.  We  view  a 
distributed  database  system  as  being  built  on  top  of  exist¬ 
ing  systems  available  at  the  local  sites  of  a  computer  net¬ 
work.  If  a  distributed  database  is  to  evolve  naturally, 
there  must  be  support  for  extending  it  to  the  underlying 
heterogeneous  systems. 


Database  design  is  complicated  by  the  difficulty  in 
designing  pnysical  databases  for  a  variety  of  storage  struc¬ 
tures  supported  by  the  different  underlying  systems.  We  fol¬ 
low  [CARD75]  in  partitioning  the  physical  design  process 
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into  its  implementation-oriented  and  implementation- 
dependent  aspects.  A  physical  design  is  specified  in  terms 
of  basic  concepts  of  storage  structure  without  making  a  com¬ 
mitment  to  an  actual  implementation.  A  physical  design  is 
realized  by  mapping  the  system  independent  specification 
into  the  storage  structures  available  in  a  particular  data¬ 
base  system.  Analytic  methods,  such  as  [CARD73 ,  GOTL74, 
CARD75 ,  YA075 ,  SILE76,  DUHN78 ,  SCHK78]  ,  can  be  used  in  the 
latter  step.  Our  contribution  is  to  provide  an  integrated 
framework  for  logical  and  physical  design,  and  to  provide 
design  tools  with  a  high  degree  of  independence  from  the 
underlying  data  models  and  systems. 

In  tnis  paper,  we  propose  the  concept  of  an  access  path 
data  model  for  pnysical  design.  The  access  path  model  has 
grown  out  of  the  attempts  to  extend  our  work  with  the  Design 
Model  [WONG78, WONG79]  to  problems  of  pnysical  database 
design.  The  term  "data  model"  is  used  in  a  generic  sense  to 
mean  a  collection  of  data  object  types,  sucn  as  attributes 
and  relations  in  the  relational  model.  "Schema"  is  used  to 
mean  a  specific  choice  of  data  objects  to  represent  a  data¬ 
base,  such  as  a  specific  choice  of  relations  and  associated 
attr ibutes . 

The  access  path  model  can  be  viewed  as  an  interface 
between  the  logical  view  of  data  and  the  access  methods  and 
storage  structures  chosen  to  support  that  view.  In  terms  of 
the  language  of  the  ANSI/X3/SPARC  report  (TSIC77] ,  it  medi- 
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ates  between  the  conceptual  and  internal  schemas. 

We  are  not  the  first  to  exploit  the  usefulness  of  an 
access  path  model.  The  DIAM  (Data  Independent  Access  Model) 
framework  [SENK73]  is  structured  into  four  levels  consisting 
of  entity  set,  string,  encoding,  and  physical  device  models. 
The  string  model  is  most  clearly  associated  with  our  notion 
of  access  path  model.  Although  the  DIAM  model  is  a  signifi¬ 
cant  contribution,  we  believe  that  our  formulation  of  access 
path  is  more  natural  and  easy  to  understand.  In  addition, 
the  access  path  schema  is  oriented  towards  the  problem  of 
physical  design,  rather  than  a  general  model  of  data  manage¬ 
ment  systems. 

The  paper  is  organized  as  follows.  A  semantic  data 
model  is  presented  which  is  the  basis  for  our  approach  to 
database  design.  Logical  access  paths  are  represented  by 
functional  interrelationships  between  objects.  The  access 
path  model  is  defined  to  capture  those  functions  which  can 
be  used  to  efficiently  access  objects  in  the  physical  reali¬ 
zation  of  the  database.  A  methodology  for  specifying  an 
implementation-oriented  physical  design  is  given  which  is 
based  on  assigning  the  highest  level  support  for  the  most 
frequently  traversed  access  paths.  A  simple-minded  approach 
for  mapping  a  design  specification  into  the  storage  struc¬ 
tures  of  CODASYL  DBTG  systems  is  included.  We  conclude  the 
paper  with  a  discussion  on  future  directions. 

2.  The  Design  Model 
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The  design  model  is  the  starting  point  for  our  approach 
to  database  design.  It  has  been  formulated  to  capture  the 
kinds  of  integrity  constraints  supported  by  the  relational 
and  DBTG  models,  yet  remains  independent  of  them.  The  model 
is  based  on  a  variation  of  the  entity-relationship  model 
[CHEN76J  and  has  been  influenced  by  the  semantic  data  model 
of  [SCHM75] .  A  more  complete  discussion  of  the  design  model 
and  its  application  to  logical  design  and  schema  conversion 
can  be  found  in  [WONG79] . 

For  each  instance  of  time  t,  let  E^(t),  E^tt),  ...  , 

£n(t)  be  n  distinct  sets,  which  are  called  entity  sets .  A 
property  of  an  entity  set  E(t)  is  a  one-parameter  family  of 
functions  ffc,  mapping  at  each  t  E(t)  into  a  set  V  of  values. 
Because  ffc  is  defined  for  every  element  of  the  domain,  it  is 
a  total  function.  As  an  example,  consider  the  following 
entity  sets  and  properties: 

entity  sets  properties 

emp  ename ,bir thyr 

dept  dname, location 

job  title , status , salary 

A  relationship  Rfc  among  entity  sets  E^(t),  E2<t),  ...  , 
En(t)  is  a  subset  of  tne  cartesian  product  E^(t)  X  E2<t)  X 
...  X  En(t)  at  each  time  t.  Properties  of  relationships  may 
be  defined  in  an  analogous  way  to  properties  of  entity  sets. 
Relationships  are  assumed  to  be  independent,  i.e.  not  deriv¬ 
able  from  otner  relationships,  and  indecomposible ,  i.e.  not 

equal  to  the  join  of  their  projections  into  subrelation- 
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ships.  For  example,  the  following  two  relationships  specify 
the  employees  qualified  to  hold  each  job,  and  the  job  allo¬ 
cated  to  a  given  department.  "Number  allocated"  may  be 
specified  as  a  property  of  "allocation": 

relationships  properties 

qualified (job,emp)  - 

allocation (dept , job)  number 

We  further  distinguish  the  types  of  relationships 
recognized  by  the  design  model.  A  binary  relationship  Rfc  on 
entity  sets  E^(t)  and  E 2 ( t )  is  single-valued  in  E^(t)  if 
each  entity  of  E^(t)  occurs  in  at  most  one  instance  of  Rt. 
Intuitively,  we  may  think  of  R^  as  representing  a  function 
from  E^(t)  into  E^(t) ,  because  each  entity  in  E^(t)  can  be 
related  to  no  more  than  one  entity  in  E2(t).  If  each  entity 
in  E^(t)  occurs  in  exactly  one  instance  of  Rfc ,  Rfc  is  called 
an  assoc iation .  We  may  think  of  Rfc  as  representing  a  total 
function.  Single-valued  relationships  which  are  not  associa¬ 
tions  can  be  thought  of  as  partial  functions,  because  at  a 
given  point  in  time,  the  function  need  not  be  defined  over 
all  entities  in  E^(t).  Associations  are  used  to  model  the 
situation  in  which  the  domain  object  can  exist  only  if  it  is 
related  to  some  range  object.  If  an  object  in  the  range  of 
an  association  is  deleted,  then  the  objects  in  the  domain  no 
longer  occur  in  an  instance  of  Rfc.  They  must  be  deleted  to 
maintain  the  totality  of  the  function.  Examples  of  associa¬ 
tions  include: 

works-in  (emp,dept) 


x 
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assignment  (emp,job) 

which  represent  the  facts  that  an  employee  must  work-in  some 
department  at  all  times  and  must  be  assigned  to  some  job  at 
all  times.  An  example  of  a  single-valued  relationship  which 
is  not  an  association  is: 

mgr  (dept,emp) 

which  associates  a  managing  employee  with  each  department, 
although  a  department  can  exist  without  a  manager. 

Explicit  provisions  for  value  set  definitions  have  been 
omitted  in  our  model.  A  subsystem  sucn  as  that  proposed  in 
[MCLE76]  could  be  included,  but  it  is  unlikely  that  existing 
systems  could  support  sophisticated  domain  definition.  A 
simpler  approach  is  to  use  the  primitive  data  types  sup¬ 
ported  by  most  systems  for  the  domain  definition  (e.g., 
integer,  char (10;,  etc.). 

Our  design  model  can  be  reformulated  to  represent  logi¬ 
cal  access  paths  in  terms  of  total  and  partial  functions 
between  objects.  This  is  similar  to  the  approach  taken  in 
the  functional  data  models  of  [SHIP79]  and  [BUNE79]  .  The 
objects  of  the  schema  are  the  value  sets,  entity  sets,  and 
relationships.  Single-valued  relationships  are  partial  func¬ 
tions,  while  associations  and  properties  are  total  func¬ 
tions.  In  addition,  total  functions  can  be  defined  to  map  a 
relationship  object  into  the  entity  set  objects  which  parti- 
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The  above  example  is  reproduced  here  in  terms  of  the 
functional  viewpoint  (some  abbreviation  has  taken  place) : 


total  functions 
ename:  emp  — >  char (20) 
birthyr:  emp  — >  integer 
dname:  dept  — >  char (10) 
location:  dept  — >  char (20) 

qual-emp:  qual  — >  emp 

qual-job:  qual  — >  job 


title:  job  — >  cnar(15) 
salary:  job  — >  integer 
works-in:  emp  — >  dept 
assignment:  emp  — >  job 
alloc-dept:  alloc  — >  dept 
alloc-job:  alloc  — >  job 
number:  alloc  — >  integer 


partial  functions 
mgr:  dept  — >  emp 


A  design  schema  can  be  represented  graphically.  Let  I 
=  (V , E )  be  a  directed  graph  with  set  V  of  vertices  and  set  E 
of  edges.  For  each  object  in  the  schema/  there  is  a  vertex 
in  V.  For  each  function  from  object^  to  object2,  there  is  a 
directed  edge  from  the  vertex  for  object^  to  the  vertex  for 
object^.  Value  objects  are  represented  by  black  vertices, 
non-value  objects  by  white.  The  graphical  representation  of 
the  example  schema  is: 
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3.  The  Access  Path  Model 

The  functions  of  the  design  model  represent  logical 
access  paths  that  can  be  used  to  navigate  among  the  objects 
of  the  scnema.  For  example,  WORKS-IN  (ENAME-^  (" fred" ) )  gives 
us  the  department  that  Fred  works  in.  The  access  path  schema 
is  concerned  with  those  functions  and  inverses  that  are 
“supported"  for  efficient  access  by  the  underlying  database 
system.  “Support"  is  used  in  an  operational  sense  to  mean 
that  the  time  to  perform  a  supported  access  is  less  than  the 
time  to  perform  an  unsupported  one. 

When  used  to  access  objects,  logical  access  paths  are 
called  access  mappings .  An  access  mapping  may  be  defined 
for  either  a  function  or  its  inverse.  To  make  it  possible  to 
compose  access  mappings,  we  extend  the  definition  to  allow 
tnem  to  be  defined  over  sets  of  domain  objects.  An  access 
mapping  is  supported  in  the  storage  structure  if  the  data¬ 
base  system  can  efficiently  perform  the  desired  access, 
i.e.,  the  time  to  access  an  object  via  a  supported  access 
map  is  less  than  the  time  to  scan  the  object  set  exhaus¬ 
tively  for  the  desired  object (s).  If  an  access  mapping  is 
not  supported,  it  is  an  unsupported  access  mapping.  Sup¬ 
ported  access  mapping  is  our  terminology  for  the  usual 
notion  of  access  path. 

An  access  path  schema  consists  of  the  objects  of  the 
design  scnema  and  the  supported  access  mappings.  A  graphical 
representation  similar  to  the  one  proposed  in  the  previous 
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section  can  be  used  to  represent  an  access  path  schema.  The 
functional  interrelationships  of  the  design  schema  must  be 
maintained  whether  or  not  those  functions  are  supported.  For 
example,  WORKS-IN  associates  with  each  employee  a  single 
department.  If  WORKS-IN  is  not  supported,  we  must  still  be 
able  to  access  the  associated  department,  albeit  not  as 
efficiently  as  if  WORKS-IN  had  been  supported.  To  accomplish 
this,  we  introduce  the  concept  of  identifier .  An  identifier 
is  a  1  —  to  —  1  property  of  an  entity  set  which  is  used  to 
uniquely  represent  each  entity  in  the  set.  An  unsupported 
access  mapping  between  employees  and  departments  can  be 
represented  instead  as  an  access  mapping  between  employees 
and  the  identifier  value  set  of  department. 

(supported)  WORKS-IN:  emp  — >  dept 

(unsupported)  WORKS-IN:  emp  — >  id  value  set  of  dept 

The  access  path  schema  captures  the  effects  of  storage 
structure  support  without  committing  the  schema  to  a  partic¬ 
ular  implementation  and  without  sacrificing  any  of  the 
interrelationships  of  the  design  schema.  WORKS-IN  can  be 
used  to  navigate  directly  between  employees  and  departments 
only  if  the  mapping  is  supported  by  the  underlying  system. 
It  is  immaterial  whether  this  support  is  furnished  by  a  phy¬ 
sical  pointer  between  employee  records  and  department 
records,  an  index  that  maps  employee  identifiers  into 
department  records,  or  some  other  technique. 

4.  Physical  Database  Design 
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The  access  patn  schema  provides  a  useful  interface 
between  the  user's  logical  view  of  the  data  and  its  physical 
implementation.  In  this  section,  we  will  describe  an 
implements tion-or len ted  physical  design  methodology  whicn  is 
largely  independent  of  the  specific  database  system  and  data 
model.  The  implementation-dependent  aspects  will  be  dis¬ 
cussed  in  section  5. 

The  approach  is  to  generate  designs  which  provide  the 
best  possible  support  for  each  access  path  without  conflict¬ 
ing  with  the  support  for  the  other  access  paths  of  the 
schema.  The  concept  of  maximal  conflict-free  schema  is 
introduced  below.  A  specification  of  the  user's  expected 
access  patterns  is  used  to  direct  the  design  process.  A 
system  specific  mapping  is  then  invoked  to  implement  the 
access  path  schema  in  terms  of  the  storage  structures  avail¬ 
able  in  the  target  system. 

4.1  Algebraic  Structure  for  Physical  Design 

For  the  purposes  of  implementation-oriented  design,  we 
shall  use  the  logical  access  paths  of  the  design  schema.  An 
access  path  schema  may  be  used  to  represent  those  paths 
actually  cnosen  for  support.  Properties  of  an  access  map¬ 
ping  can  be  formulated  to  capture  desirable  characteristics 
of  traversing  the  mapping  in  either  the  functional  or 
inverse  functional  direction.  Consider  the  schema  function 
f:  A  — >  B.  The  following  properties  of  the  mapping  can  be 
defined : 
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(1)  Evaluated :  given  a  in  A,  f(a)  can  be  found  without  an 
exhaustive  scan  of  B,  i.e.,  the  cost  to  access  f ( a )  is 
less  than  the  cost  to  access  each  element  of  B. 

(2)  Indexed ;  given  b  in  B,  f-^  (b)  can  be  found  without  an 
exhaustive  scan  of  A. 

(3)  Clustered:  the  elements  of  f”‘*'(b)  are  in  close  proxim¬ 
ity")  i.e.,  the  cost  to  access  the  elements  in  the 
inverse  is  less  than  the  cost  to  access  an  aroitrary 
subset  of  the  same  cardinality. 

(4)  Well  Placed :  a  and  f(a)  are  stored  in  close  proximity, 
i.e.,  the  cost  to  access  both  is  less  than  the  cost  to 
access  them  separately. 

We  make  tne  critical  assumption  that  each  object  of  the 
schema,  be  it  a  value,  an  entity,  or  a  relationship 
instance,  is  assigned  to  a  single  stored  record.  Replica¬ 
tion,  e.g.,  the  replication  of  data  item  values  to  record 
instances,  will  be  made  explicit  by  introducing  new  objects 
into  the  schema.  The  usual  concept  of  "record"  can  be 
represented  as  a  concatenation  of  tne  stored  records  of  the 
values  that  make  up  the  fields  of  the  record.  Our  approach 
does  not  preclude  the  record  segmentation  and  allocation 
techniques  described  in  [SCHK78] .  Given  this  assumption, 
certain  implication  rules  can  be  formulated: 

(i)  well  placed  ==>  evaluated 

Clearly  it  makes  no  sense  to  place  a  near  its  associated 
range  without  knowing  what  f(a)  is  in  the  first  place. 

(ii)  clustered  ==>  indexed 

Similarly,  it  makes  no  sense  to  place  the  elements  of  f”*(b) 
together  unless  f~^(b)  can  be  found. 
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(iii)  well  placed  ==>  clustered 

Let  b  =  f(a).  Well  placed  means  that  a  and  b  are  stored 
together.  Since  there  is  one  record  for  each  b  instance, 
all  A  objects  with  b  in  the  range  of  f  will  be  placed  near  b 
and  hence  near  each  otner .  Thus  clustering  is  achieved. 

For  systems  without  index  storage  structures,  it  is 
possible  to  have  a  mapping  wnich  is  evaluated  out  not 
indexed.  For  example,  an  employee's  name  may  be  stored  in 
the  same  record  that  represents  the  employee,  with  no 
storage  structures  available  to  access  the  record  via  an 
employee  name.  The  opposite  is  possible  too.  Some  inverted 
file  systems  allow  access  to  a  record  through  a  value  asso¬ 
ciated  with  the  record  that  is  not  accessible  from  that 
record.  For  example,  an  employee's  name  may  not  be  stored 
with  the  record  that  represents  the  employee,  but  an  index 
on  employee  name  is  available.  Thus  evaluated  need  not 
imply  indexed  and  vice  versa. 

The  implication  rules  can  De  used  to  impose  a  partial 
ordering  among  the  properties: 


A  label  is  an  assignment  of  properties  to  an  edge  of  the 
integrity  schema.  There  are  six  distinct  labels:  W,  <C,E>, 
<I,E>,  C,  I,  and  E.  Because  we  wish  to  generate  schemas  with 
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maximally  supported  access  paths  and  because  the  evaluated 
pcoperty  is  independent  of  clustered  and  indexed,  we  concern 
ourselves  only  with  the  first  three  labels  which  are  denoted 
as  "W" ,  "C" ,  and  "I",  i.e.,  we  assume  that  an  access  path  is 
always  evaluated.  A  labelling  is  an  assignment  of  a  laoel 
to  each  edge  of  tne  schema,  denoted  as  an  n-tuple  ( 1^ ,  12, 
...  ,  ln)  where  n  is  the  number  of  edges  in  the  schema.  The 
assignment  is  subject  to  constraints  which  are  shown  below. 
The  partial  ordering  among  properties  induces  a  partial  ord¬ 
ering  among  labels  as  well:  "W"  >  "C"  >  "I".  A  partial  ord¬ 
ering  can  be  defined  for  labellings.  Let  and  be  two 
labellings  over  the  same  schema.  We  say  that  *  l2  if  for 
each  edge  in  the  schema,  ' s  assigned  laoel  is  the  same  as 
L2's  assigned  laDel .  We  say  that  L2  >  if  for  each  edge  in 
the  schema,  either  L^'s  assigned  label  is  the  same  as  L2's 
or  L2's  label  >  L^'s,  and  ^  L2>  Note  that  under  this 
definition,  some  labellings  are  incomparable,  e.g.  = 
("W-V'C")  and  L2  =  ( "C"  ,  "W“  )  . 

An  obvious  approach  to  achieving  a  maximal  labelling  is 
to  assign  "W" ,  the  label  that  represents  the  highest  degree 
of  support,  to  each  edge.  Unfortunately,  certain  labellings 
represent  a  choice  of  properties  which  can  not  be  supported 
simultaneously  within  a  schema.  There  are  four  constraints 
which  conflict-free  labellings  must  meet: 

(i)  cluster  constraint :  it  is  not  possible  to  label  more 

than  one  outedge  of  a  node  with  a  MC"  or 
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"W" .  Clustering  places  together  all 
domain  objects  which  are  mapped  into  the 


same  range  object.  It  is  not  possible  to  partition  the 
domain  on  more  than  one  function  and  still  achieve  this 
advantageous  placement.  Note  that  1-to-l  properties  do  not 
cause  a  conflict  because  a  1-to-l  function  partitions  the 
domain  objects  into  clusters  of  size  one.  This  can  always  De 
supported  regardless  of  additional  clustering. 


(ii)  placement  constraint :  it  is  not  possible  to  label  more 

than  one  inedge  of  a  node  with  "W" . 
Well-placement  places  clusters  of  domain 
objects  with  a  common  range  object  near 
that  range  object.  It  is  not  possible  to  achieve  this  advan¬ 
tageous  placement  simultaneously  for  domain  objects  from 
more  than  one  function. 


(iii)  placement-cluster  constraint :  it  is  not  possible  to 
“VO"  •  simultaneously  label  an  inedge  of  a  node 

^  y  j  "W"  while  labeling  an  outedge  “C“ .  The 

placement  of  X  object  clusters  near 
their  associated  Y  objects  destroys  the  advantageous  clus¬ 
tering  of  the  Y  objects.  1-to-l  functions  do  not  cause  the 
constraint  to  be  violated. 


(iv)  implied  constraints :  Certain  compositions  of  functions 


and  their  properties  result  in  the  vio¬ 
lation  of  one  of  the  above  constraints. 
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For  example,  this  schema  would  cause  a 
violation  of  an  implied  cluster  constraint. 

We  need  not  enforce  all  the  constraints  if  we  allow 
requirements  to  be  specified  which  can  not  be  realized.  The 
resulting  schema  will  not  be  as  well  designed  as  one  for 
which  all  the  constraints  are  met.  For  example,  if  (iii)  is 
not  enforced,  some  of  the  desirable  properties  of  clustering 
for  object  Y  will  be  lost. 

The  degree  of  a  schema  is  the  number  of  violations  of 
placement  or  cluster  constraints  that  may  be  made  during  the 
labelling  process.  Each  of  these  violations  can  be  resolved 
if  we  allow  replication  of  objects.  Assume  that  the  schema 
is  labelled  as  in  (i).  A  conflict  is  a  violation  of  a  clus¬ 
ter  or  placement  constraint.  A  cluster  conflict  can  be 
resolved  by  one  of  the  following  methods: 


In  (A),  a  copy  of  the  domain  object  is  made,  and  both  the 
original  and  the  copy  are  clustered  on  the  appropriate 
ranges.  In  (B)  ,  a  copy  of  the  range  is  made  and  placed  in 
one-to-one  correspondence  with  the  original  domain  object. 
To  illustrate  this,  consider  the  entity  set  employees  and 
the  value  set  integers,  interrelated  by  the  property  func- 
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tion  age.  Schematically,  the  following  situation  can  arise: 


The  effect  of  type  (B)  cluster  resolution  is  to  replicate 
the  age  values  so  there  is  one  age  value  per  employee: 


A  placement  conflict  is  resolved  in  an  analogous  way  to 
cluster  conflicts: 


The  degree  of  a  schema  is  a  measure  of  the  amount  of 
replication  we  are  willing  to  tolerate  during  the  labelling 
process.  Replicated  information  introduces  increased  costs 
for  storage  and  update  while  reducing  retrieval  costs.  A 
degree  of  0  insures  that  no  replication  will  result,  i.e., 
the  cluster  and  placement  constraints  are  never  violated;  a 
degree  of  n  >  0  will  allow  up  to  n  replicated  objects  to  be 
created . 
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A  maximal  labelling  is  a  labelling  L  for  which  there 
exists  no  labelling  L'  such  that  L'  >  L.  Because  not  all 
labellings  are  comparable,  it  is  possible  to  generate  many 
maximal  labellings  for  the  same  schema.  Rather  than  gen¬ 
erate  all  the  possible  maximal  labellings  for  a  given 
schema,  usage  information  can  be  use  to  restrict  the 
enumeration  to  those  that  best  support  the  expected  usage 
patterns  of  the  database. 

4.2  A  Labelling  Algorithm 

In  this  subsection,  we  present  an  algorithm  for  gen¬ 
erating  a  maximal  labelling  that  specifies  superior  support 
for  the  access  paths  most  heavily  travelled.  Assume  that  the 
degree  of  replication  is  n.  This  means  that  up  to  n  place¬ 
ment  or  cluster  conflicts  will  be  tolerated  while  labelling 
the  schema.  These  conflicts  will  later  be  resolved  using  the 
techniques  of  the  previous  subsection. 

The  input  to  the  algor itnm  is  a  schema  to  be  labelled 
and  a  ranking  of  the  edges  (access  mappings)  according  to 
frequency  of  traversal.  The  algorithm  only  enforces  cluster 
and  placement  constraints.  Initially  all  edges  are  labelled 
“I".  We  begin  by  assigning  the  next  most  favorable  label 
(“C" )  to  the  heaviest  used  edge.  We  continue  assigning 
labels  in  this  manner  until  either  n  cluster  conflicts  have 
been  detected  or  all  edges  have  been  examined.  Then  we 
assign  the  most  favorable  label  ("W )  to  the  most  heavily 
used  edge  that  is  already  labelled  "C".  We  continue  until  a 


-j  \i.  -g- 


-17- 


Access  Patn  Model 


Katz  &  Wong 


total  of  n  cluster  or  placement  conflicts  are  detected.  The 

^  u 

edge  that  causes  the  n+l*"  conflict  is  not  relabelled.  The 
algorithm  to  assign  labels  is: 

♦conflicts  <-  0 

for  each  edge  do  label  edge  ”1” 
for  each  edge  (in  freauency  of  access  order)  do 
I"  label  edge  “C" 

I  if  cluster  conflict  then 

I  if  fconflicts  =  n  then  relabel  edge  "I" 

L  else  Iconflicts  <-  #conflicts  +  1 
for  each  edge  labelled  "C”  (in  frequency  order)  do 
f”  label  edge  “W" 

I  if  placement  conflict  then 
I  if  Iconflicts  =  n  then  relabel  edge  "C" 

I  else  Iconflicts  <-  Iconflicts  +  1 

When  all  edges  have  been  assigned  a  label,  resolution  is 
performed  for  each  vertex  which  does  not  meet  the  placement 
and  cluster  constraints.  Type  (A)  placement  resolution  is 
chosen  for  conflicts  involving  edges  between  non-value 
objects  (associations)  and  type  (B)  for  conflicts  involving 
edges  between  non-value  and  value  objects  (properties) . 

The  algorithm  can  be  illustrated  with  an  example.  Con¬ 
sider  the  sample  schema: 

S(SNO,SNAME) 

P  (PNO , PNAME ) 

SP (S , P ,QTY) 


Consider  the  following  ranking  of  access  mappings,  from  most 
to  least  heavily  used. 


1)  S-SP 

2)  SNO 

3)  P-SP 
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4)  QTY 

5)  PNO 

6)  PNAME 

7)  SNAME 

This  ranking  could  have  been  derived  from  a  set  of  user 

queries  in  conjunction  with  an  indication  of  relative  fre¬ 
quency,  or  simply  specified  by  the  designer.  For  a  degree  of 
replication  =  1,  the  algorithm  proceeds  as  follows: 
initial  labelling :  all  edges  labelled  "I" 

C-labelling: 

STEP  1:  label  S-SP  with  MC" 

STEP  2:  label  SNO  with  MC“ 

STEP  3:  P-SP  can  not  be  labelled  “C"  without  a  conflict. 

Label  it  MC“ .  No  additional  conflicts  are  allowed. 

STEP  4:  QTY  can  not  be  labelled  "C"  without  a  conflict. 

STEP  5:  label  PNO  with  “C" 

STEP  6:  label  PNAME  with  "CM  (does  not  conflict  with  PNO) 

STEP  7:  label  SNAME  with  "C"  (does  not  conflict  with  SNO) 

W-labell ing :  all  edges  labelled  “C"  can  be  labelled  "W" 
without  conflict. 

The  resulting  labelling  is: 


*W 

4t 

•V* 

Placement  resolution  must  be  performed  for  SP.  The  more  fre¬ 
quently  used  edge  will  eminate  from  the  original  SP  while 
the  less  frequently  used  one  will  eminate  from  the  repli¬ 
cated  SP'.  Type  (A)  resolution  is  used  because  the 


-19- 


Access  Path  Model 


Katz  &  Wong 


conflicting  edge,  P-SP,  involves  non-value  vertices: 


We  note  that  fully  constrained  labelling  can  be  formu¬ 
lated  in  terms  of  an  integer  linear  program  whose  objective 
function  seeks  to  maximize  the  sum  of  the  frequencies  of  the 
edges  labelled  "W"  and  "C" .  Further  details  can  be  found  in 
[KATZ 8ld ]  . 

5.  Implementing  a  Schema 

Up  to  this  point,  the  design  has  been  independent  of 
the  actual  data  model  and  system.  In  this  section  we 
briefly  discuss  the  considerations  involved  in  mapping  a 
labelled  schema  into  DBTG  storage  structures. 

The  quality  of  the  mapping  depends  on  the  detail  of 
usage  information  specified.  In  the  following,  we  assume 
that  information  has  been  specified  at  the  level  of  the  pre¬ 
vious  section.  All  property  mappings  are  *'evaluatedM  sup¬ 
ported  by  placing  the  range  value  in  the  record  that 
represents  the  entity  or  relationship  instance. 

In  the  new  CODASYL  proposal  [CODA78] ,  many  aspects  of 
the  physical  database  design  have  been  removed  from  the 
schema  DDL  and  localized  in  data  storage  definition.  The 
DSDL  provides  facilities  for  the  specification  of  the 
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pagination  of  the  storage  media,  schema  to  storage  record 
mapping,  record  pointer  implementation,  set  representation, 
and  storage  record  placement.  We  do  not  deal  with  the 
specification  of  the  storage  media,  and  assume  that  all  sets 
are  represented  by  chains  with  direct  pointers.  Additional 
usage  information  could  be  used  to  make  a  more  sophisticated 
choice  for  these  parameters. 

The  DSDL  provides  three  choices  for  the  record  place¬ 
ment  strategy.  A  record  may  be  calc'd  (hashed)  on  a  key 
specified  in  the  DDL,  clustered  by  set  membership  and 
optionally  placed  near  the  owner,  or  stored  in  sequential 
sorted  order.  Indexes  can  be  specified  separately  for  keys 
specified  in  the  DDL. 

At  most  one  non-l-to-1  outedge  of  a  node  can  be 
labelled  "W"  or  "C".  Because  identifier  outedges  do  not 
exhibit  useful  clustering  properties,  they  are  ignored.  The 
other  clustered  or  well  placed  outedge  determines  the  record 
type's  primary  storage  structure.  If  the  outedge  represents 
a  property,  then  the  record  type  is  stored  sequentially  and 
sorted  and  indexed  on  the  appropriate  data  item.  Calc'd 
could  be  chosen,  but  more  detailed  usage  information  would 
be  needed  to  determine  whether  access  is  primarily  on  equal¬ 
ity  rather  than  range.  In  the  absence  of  such  information  we 
always  choose  indexed  structures.  If  the  outedge  represents 
an  association,  then  the  record  type  is  clustered  on  the 
associated  set,  and  if  "W"  is  specified,  the  records  are 
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placed  near  the  owner.  The  following  rules  can  be  used  to 
determine  the  record  type's  structure: 

for  each  record  type  (node) : 

(1)  for  each  outedge  labelled  "W"  or  "C" , 

*  if  the  outedge  is  a  property,  then  structure  the  record 

type  on  the  indicated  data  item 

*  else  the  outedge  is  an  association;  then  cluster  the 

record  type  on  the  indicated  set  and  if  labelled  "W" , 
place  near  the  owner 

(2)  for  each  outedge  labelled  "I"  or  an  identifier 
outedge, 

*  index  the  record  type  on  the  associated  data  item 

The  DSDL  also  provides  facilities  to  allow  a  single 
schema  record  to  be  represented  by  multiple  stored  records. 
This  corresponds  closely  to  our  formulation  of  replication. 
Consider  the  following  degree  1  labelling  and  its  associated 
CODASYL  schema: 

record  types  sets 

STSHOTsHamET  S-SP,  Owner  S,  Member  SP 

P ( PNO , PNAME )  P-SP,  Owner  P,  Member  SP 

SP(QTY) 


The  DSDL  specification  for  the  schema  would  be: 

MAPPING  FOR  S 

STORAGE  RECORD  IS  S 

'  MAPPING  FOR  P 
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STORAGE  RECORD  IS  P 

MAPPING  FOR  SP 

STORAGE  RECORDS  ARE  SP,SP' 

STORAGE  RECORD  NAME  IS  S 

PLACEMENT  IS  SEQUENTIAL  ASCENDING  SNAME 
SET  S-SP  ALLOCATION  IS  STATIC 

POINTER  FOR  FIRST,  LAST  RECORD  SP 
IS  TO  SP 

STORAGE  RECORD  NAME  IS  P 

PLACEMENT  IS  SEQUENTIAL  ASCENDING  PNAME 
SET  P-SP  ALLOCATION  IS  STATIC 

POINTER  FOR  FIRST,  LAST  RECORD  SP 
IS  TO  SP' 

STORAGE  RECORD  NAME  IS  SP 
LINK  TO  SP' 

PLACEMENT  IS  CLUSTERED  VIA  SET  S-SP  NEAR  OWNER  S 
SET  S-SP  ALLOCATION  IS  STATIC 
POINTER  FOR  NEXT,  PRIOR 
POINTER  FOR  OWNER 

STORAGE  RECORD  NAME  IS  SP' 

LINK  TO  SP 

PLACEMENT  IS  CLUSTERED  VIA  SET  P-SP  NEAR  OWNER  P 
SET  P-SP  ALLOCATION  IS  STATIC 
POINTER  FOR  NEXT,  PRIOR 
POINTER  FOR  OWNER 


plus  specification  for  INDEXES  for  each  data  item  not 


covered  in  the  above.  The  access  schema  for  the  above  is: 


All  access  mappings  are  maximally  supported.  Usage  informa¬ 
tion  may  indicate  that  certain  paths  are  not  worth  the  over¬ 
head  of  supporting  them. 
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6.  Conclusions  and  Future  Work 

In  this  paper  we  have  proposed  an  access  path  model  for 
physical  database  design  as  an  extension  of  our  original 
work  with  a  semantic  model  for  logical  database  design  and 
schema  conversion.  The  properties  of  access  paths  were  dis¬ 
cussed  and  a  methodology  which  generates  maximally  supported 
schemas  was  proposed  and  illustrated  with  examples.  We 
oelieve  that  this  approach  to  qualitative  physical  design  is 
new  and  unique. 

We  have  briefly  discussed  the  applications  of  our 
methodology  for  designing  CODASYL  physical  databases.  More 
work  is  required  on  usage  specification  in  order  to  improve 
the  quality  of  the  design. 

The  access  path  model  also  has  applications  to  problems 
of  program  translation.  A  generalized  query  processing  algo¬ 
rithm  can  be  formulated  to  ,,compile,,  non-procedural  queries, 
e.g.,  relational  calculus,  into  the  access  paths  supported 
in  tne  access  schema.  Primitive  operations  on  the  access 
schema  can  be  defined  in  a  way  that  facilities  implementing 
these  operations  in  terms  of  CODASYL  DML.  In  addition,  we 
have  been  investigating  how  to  reverse  the  process,  i.e., 
"decompil ing“  programs  that  access  data  at  the  level  of  DML 
into  non-procedural  queries,  with  the  aid  of  the  access 
schema.  These  problems  are  further  explored  in  [ KATZ 80 ] . 
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